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1. INTRODUCTION 

Big data and Internet finance (Fintech) are currently trending and are often discussed in the world as 
internet lending industry is known as peer to peer (P2P) Lending. P2P lending as a Fintech platform has a 
unique characteristic in transactions, namely connecting individual loan borrowers to individual lenders or 
investors to make credit agreements and complete transaction procedures directly through the online 
platform, without commercial bank intermediaries. Gradually, the existence of P2P lending has become a 
solution for small and medium businesses to get loan capital so easily that every year the loan amount is very 
large. As reported by LendingClub Corporation [1] about "Fourth Quarter and Full Year 2019 Results" show 
that the loan amount had achieved US $ 12,290.1 billion at the end of 2019. While Stern et al. [2]’s data 
showed that China’s government noted that China became the most P2P loan platforms predicate of the 
investment market with quantitating to around 2.300 as of March 2017 and CNY 9.208 loan volume. 
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P2P lending presents an opportunity as well as a challenge, including China as a developed 
economy. In order to a large extent, P2P lending meets China’s current economic needs as well as its risks. 
Financial risk can be seen from liquidity risk caused by insufficient liquidity funds, unbalanced information 
as a cause of credit risk and legal risk caused by unclear laws governing Fintech. In brief, the risk 
characteristics of Fintech are more complex than conventional finance. In addition, technical and virtual- 
based Fintech also triggers special risks that arise such as conventional financial risks as financial risks are 
sudden and spreading, besides that the increase in destructive risks is very serious and uncontrollable Challa 
et al. [3] said risk aversion is one of the hot topics, interesting and very important to be discussed among 
investors, policymakers, financial practitioners and made studies by researchers. 

Generally, research and application of loan evaluation in P2P lending platforms are given two main 
directions. First, the use of credit scores to evaluate the credit risk of loans and second, transforming loan 
evaluations into a binary classification. A credit scorecard is a conventional loan evaluation method. Usually, 
Chen and Han [4] explained these scorecards are self-launched by P2P lending platforms for business needs, 
for example, Fair Isaac Corporation (FICO) score and LendingClub score. However, according to 
Malekipirbazari and Aksakalli [5], credit scorecards cannot distinguish between defaulter and non-defaulter. 
As big data technology matures many researchers use machine learning techniques to predict whether a loan 
can be returned or a loan repayment is due in P2P lending platform. Light gradient boosting machine 
(LightGBM) is a machine learning algorithm that is used as a classification. LightGBM is an improved 
version of the gradient learning framework based on decision trees and "weak" learner ideas. Since being 
developed by Microsoft in 2017 [6]. Since LightGBM was introduced in 2016, several researchers have 
applied the big data machine learning Algorithm in various fields and produce predictions with very high 
accuracy, fast-computationally and well-performance in minimizing relative over-fitting. Such as, web 
search, Breast cancer to identify miRNAs [7], the default accuracy prediction of P2P lending platform [8]- 
[11], music recommendation [12], the classification of acoustic scene [13], smart grid load forecasting [14], 
estimation of reference evapotranspiration of agricultural or hydrological [15], construction cost prediction 
[16], predict customer loyalty Fintech [17] and stream processing prediction [18]. 

LightGBM is known as an algorithm that is fast data learning, faster when handling big data, high 
accuracy, good model precision, low data memory consumption so that this algorithm is considered more 
effective and efficient than other machine learning techniques [8], [19]. According to Rao et al. [20], feature 
selection in a big data set as a significant phase performs several tasks such as image classification, cluster 
analysis, data mining, pattern recognition, and image capture [21], [22]. Many methods have been proposed, 
improved and discussed for feature selection. Alickovic and Subasi [23]-[24] improved whale optimization 
algorithm (WOA) to optimize features in the dataset. Zhu et al. [25] presented a method of uncontrolled 
spectral feature selection to maintain local and global features of the feature during the redundant feature 
removal process. Wan and Freitas [26] evaluated the hierarchy method in optimizing the feature selection of 
aging related gene data sets. Rao et al. [20] used artificial bee colony and gradient boosting decision tree to 
select features of eight UCI data sets and produced and the experimental results proven that Rao’s method is 
able to reduce the dimensions of the data set and achieve superior classification accuracy. Ghosh et al. [27] 
improved the wrapper-filter feature selection method based on ant colony optimization to reduce 
computational complexity. 

Based on the previous research described above, increasing prediction accuracy via feature selection 
techniques is focus of this study. Therefore, we use two swarm algorithms, i.e. ant colony optimization 
(ACO) algorithm and Bee Colony Optimization (BCO) algorithm as a feature selection and LightGBM as a 
tool to evaluate P2P lending data sets. This study aims to determine the two swarm algorithms performance 
in the feature selection process, then the prediction performance of the LightGBM algorithm. In addition, we 
also use the synthetic minority oversampling technique (SMOTE) to address data class imbalances. This 
technique is believed to also be able to improve the accuracy of predictions as has been proven by Faris et al. 
[28] to predict the bankruptcy of companies with highly imbalanced data classes. 


2. RELATED KNOWLEDGE AND THEORY 
2.1. Lending club 

The lending club has lanced an impact on risk management. Loan applications can be approved are 
very small, around 10% of all applications. In addition, there are lending club levels i.e. A to G to classify 
loans based on risk. The main role of the Lending club is to make it easier for borrowers and lenders or 
investors to transact and provide information related to However. In fact, there are many problems in this 
transaction model, such as loan money is not returned by the borrower according to the agreement so that 
investors experience losses. Determining loan interest rates according to loan credit and loan term. The 
Lending club business pattern is shown in Figure 1. 
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Figure 1. Lending club business pattern [29] 


2.2. Bee colony optimization 

Bee colony optimization (BCO) is one of bio-inspired methods by bees gathering nectar behavior 
[30]. Akay and Karaboga [31] said that the value of global optimum is determined by neighborhood search 
optimization of each bee. Wen et al. [32] said that artificial bee colony method able to locate the global 
optimum solution for the global optimization problems. When compared to other bio-inspired heuristic 
algorithms, BCO has many strengths i.e. a simple structure, requires few control parameters and is easy for 
implementing [33]. Because of this strength, BCO has attracted the attention of researchers to study and 
apply it in various fields. 


2.3. Ant colony optimization 

Ant colony optimization (ACO) is one of bio-inspired algorithms by ant colony behavior [34]. Ants 
cannot see. However, through indirect communication, the ants can find the shortest route from nest to the 
food source [35]. Ants modify their environment (by disguising pheromone) to influence another ant 
behavior is named Stigmergy. The concept of ACO algorithms for foraging ant behavior. Algorithms often 
discussed and applied are ant system (AS), ant colony system (ACS), max-min ant system (MMAS). In 
solving the optimization problem using the ACO algorithm, several artificial ants are used to model the 
solution iteratively. For each iteration, the ants will store a certain amount of pheromone which is 
proportional to solution quality. In each rarity, Tabakhi and Moradi [36] explained that the ant calculates a 
series of feasible solutions to the current partial solution and one of the choices depends on two factors i.e. 
local heuristics and prior knowledge, three phases need to be addressed i.e. Graph representation, Heuristic 
desirability and Pheromone update rule. 


2.4. Light gradient boosting machine 

Light gradient boosting machine (LightGBM) is a fast and efficient gradient boosted decision tree 
(GBDT) algorithm with an open-source promotion work objective that was created by Microsoft MSRA in 
2016. This algorithm is used for sorting, classification, regression, and many other machine learning 
techniques assignments and supports efficient parallel training. In contrast to Xtream Gradient Boosted 
(XGBoost), LightGBM algorithm uses a histogram to speed up the training process, reduce memory space, 
and implement a wise growth strategy with depth constraints. The basic idea of LightGBM using a histogram 
is to discrete the continuity of floating-point eigenvalues to k bins and create a histogram with a width of k. 
LigthGBM does not require large storage of pre-sorted results, can store 8-bit integers and can also reduce 
memory consumption to 1/8 of the original. This rough partition does not reduce the mode of LigthGBM 
accuracy. The LightGBM is a boosting type that has three steps. For simplicity, X is given as a pre-processed 
streaming data set. 
Step 1. Initialize the weak learner by (1). 


fo(x) = argmin 7, Luc) (1) 


where: f(x) as the weak learner basis function, L(y; c) = L(y, f (x)) = (y — f (x))? as the function of loss, 

n as the amount of samples. 

Step 2. Calculate weak learners M times, Iteratively. 

a. For the sample x; E XVi = 1,2, ...,n calculate the negative gradient of loss function evaluated in the 
existing model in (2). 


_ ALO if D) 
Tmi = apy Oma) (2) 


Indonesian J Elec Eng & Comp Sci, Vol. 28, No. 2, November 2022: 1002-1011 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 ø@ 1005 


where rmi as negative gradient of the loss function. 
b. The residual r; resulted is taken as sample new real value. Fit a regression tree for; 


{(%1,1%m1)) ++» Xn» mmn )} and make a new regression tree f,,(x) . 


c. Calculate the best-fit value of the leaf area j = 1,2, ...,J. By using Cmyj in (3) as linear search to predict 
leaf node region value for minimizing the loss function. 


Cmj = argmin Èxi=Rmj L(Yi fn-1(% +c) i = 1,...,M. (3) 
d. Update the robust learner by using (4). 
fal) = fm—1 (2) + Dior Cmjl(X E Rmj) (4) 


where f,, (x) as the the existing weak leaner, f,,_,(x) as pre-weak leaner, I as the indicator function. 
Step 3. Determine the final regression tree by using (5). 


F(x) = met ay Cl (x € Rmj) (5) 


The significance of a feature is calculated as the normalized total reduction of criterion brought by that 
feature. It is also known as the Gini significance Gini is denoted by Gini (p) in (10). 


Gini(p) = Vier PACO — py) = 1- Xk- pf (6) 


where: L as the number of labels pẹ as the weight of l-label. 


3. RESEARCH METHOD 

The research method of loan default of P2P lending prediction analysis uses several phases, i.e. 
Dataset pre-processing, data oversampling, ensemble classification and performance evaluation. Generally, 
the research framework can be shown in Figure 2. 
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Figure 2. Framework of study 
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3.1. Dataset pre-processing 

Data pre-processing is a sequence of process parts that are practiced to prepare the dataset for 
analysis and modeling. Therefore, this phase is believed to be an important step in the data mining process 
[37]. In this study, data preprocessing includes data cleaning, data normalization, and data retrieval. In data 
cleaning, missing values, inconsistencies and noise (e.g., incorrect data input) are eliminated [38]-[41]. We 
use the Lending club data set for the 2019 quarter downloaded from Kaggle.com containing 20.875.146 
original user loans with 18 attributes. Furthermore, after data pre-processing, the missing value is filled via 
interpolation mode and multiple or not effect attributes are removed so that we get six attributes and Table1 
shows the attributes used in the experiment. 


Table 1. The selected attributes and pre-processing 


Feature name Description and pre-processing Type Algorithm 
amount_borrowed the principal amount of the loan upon which interest will numeric A,B,C 
accrue 
borrower_rate the interest rate at which money may be borrowed numeric A,B,C 
installment the monthly payment owed by the borrower if the loan numeric A,B,C 
originates 
principal_paid a payment toward the original amount of a loan that is owed numeric A, B,C 
interest_paid a payment of interest on a loan or mortgage numeric A, B,C 
grade lending club assigned loan grade nominal B 
term the loan repayment amount the Value represented 36 months numeric A 
from binary number to discretization 
loan_status the source of our answer to the core question if people are nominal C 


paying the loans they take out 
Note: A = LighGBM without swarm algorithms, B = LighGBM with BCO algorithm, C = LighGBM with 
ACO algorithm 


3.2. Synthetic minority oversampling technique 
Based on data pre-processing 3.1, significant differences in the number categories of normal and 

default on target variable ’loan_status’ can complicate learning modeling. SMOTE is an oversampling 

method to overcome imbalanced data sets, the SMOTE rationale as follows [29], 

a) To calculate the K-nearest neighbor of each minority sample with the Euclidean Distance as the standard, 
the neighbor algorithm is used. 

b) Adjusting a sampling proportion with the unbalance sample proportion and each sample x minority class, 
a few samples are randomly selected from its K-neighbors. 

c) Suppose xp is the selected neighbor. For each randomly selected neighbor x, a new sample can be 
generated using (11) with the respective original samples. 


Xnew = Xi + rand (0,1) * |x — xnl (7) 
By iteratively, for each sample x;, the original sample size of minority class can be widened to an ideal ratio. 


3.3. Feature selection 

First, we define the "installment" feature to represent the user’s monthly fee payment as a 
percentage of their monthly revenue. The greater the "installment" value, the more loans provided by 
investors will be more burdened and tend to default. Second, feature abstraction. We encoded the loan status 
’Current’,’ Completed’ as usual=0, encoding ’Default’,’ Charge off and’ Canceled’ as default=1. Next, we 
plot loan_status. That 89% of loan_status is "Default" and the rest is only 11% for "Normal". Based on these 
results, it indicates a serious imbalance of datasets. After scaling the features, third is feature selection. The 
selected feature attributes have high relevance or correlation value and remove irrelevant features or low 
correlation. This elimination can reduce difficulties in the training process. We use swarm algorithms i.e. 
ACO and BCO to select 6 features with the strongest correlation with the target variable and remove features 
step by step to achieve the reduction of the first dimension with variables 18 to 6. We illustrate a Pearson 
correlation graph of 18 features, as shown in Figure 3. 

Meanwhile, the results of the reduction of the first dimension, the redundant features are selected 
and removed using the Pearson correlation graph based on the swarm algorithm used. The feature dimensions 
reduced from 18 to 6 are shown in Figure 4, Figure 4(a) shown that features selection of BCO algorithm is 
amount_borrowed, borrower_rate, installment, principal_paid, grade and on Figure 4(b) shown that features 
selection of ACO algorithm is amount_borrowed, borrower_rate, installment, principal_paid and loan_status. 
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Figure 4. Person correlation of 6 features, (a) The features selection of BCO algorithm and 
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Population correlation coefficient is formulated as the covariance and standard deviation between 
two variables. Predict covariance and standard deviation of sample to determine the Pearson correlation 
coefficient of sample. Finally, we use the swarm algorithm i.e. ACO and BCO to select the importance of the 
feature and reduce the learning difficulty to optimize the model calculation. 


3.4. The evaluation performance model 

In this study, we use three parameters i.e. accuracy, AUC and ROC to evaluate and assess the 
performance of our proposed model. Accuracy is the ratio of the number of correct sample classifications to 
the total number of samples for a particular test data set as shown in (8). 


TP+TN 
TP+TN+FP+FN 


Accuracy = (8) 
where: TP=True Positives, TN=True Negatives, FP=False Positives and FB=False Negatives. 

Recall is called the fraction of all positive instances (default) where the classifier categorizes true as 
positive or known as the TP ratio. A balanced F score or Fl-score is called the balanced average of Precision 


and Recall. 


3.4.1. Receiver operating characteristic curve 

In statistics, receiver operating characteristics or ROC known as a two-dimensional graphical plot 
illustrates the performance of a binary classifier. The curve of ROC is made in various threshold settings by 
plotting true positive ratio (TPR) to the false positive ratio (FPR) by using (9). Intuitively, this curve 
represents the performance of the classifier. 


TPR = —— FPR = — 


TP+TN ~~ FP+TN 


(9) 


3.4.2. AUC value 

The AUC represents the area under the curve of ROC in the test data-set. Suppose that the curve of 
ROC is formed by a sequential relationship of points with coordinates of {(x1, Y1), (X2; Y2), ++» ms Ym}- 
Thus, the value of AUC can be formulated by using (10). 


AUC = er (Xizi = xi) è Qi + Yisa) ve 


where the AUC value range is [0.5,1.0] and if the AUC value is almost 1.0 then the classifier has a good 
performance. 


4. RESULTS AND DISCUSSION 

In research, the improved LightGBM algorithm as a classifier via features engineering or feature 
selection using an swarm algorithm i.e. ACO and BCO are evaluated and assessed their performance using 
several parameters i.e. accuracy, AUC, Fl-Score, recall and ROC curves. The results obtained are shown in 
Table 2. 


Table 2. The evaluation metrics comparison of the proposed model 


Classifier model Accuracy AUC x uae T 0 Recall T Rank 
LighGBM+ACO 95.64 % 0.956 0.97 0.97 0.96 0.97 1 
LighGBM+BCO 94.70 % 0.947 0.93 0.93 0.94 0.93 2 
LighGBM 94.38 % 0.943 0.90 0.90 0.90 0.92 3 


Table 2 shows that the performance of the LightGBM algorithm increases after the application of 
feature selection using the swarm algorithm. The performance of LightGBM+ACO algorithm is superior to 
LightGBM+BCO algorithm and LightGBM without swarm algorithm. Precision and Recall prediction 
models based on LightGBM using either the evaluation algorithm or not, all above 0.90. This value indicates 
that the model has strong generalizability. Meanwhile, the ROC curve graph is illustrated in Figure 5. This 
table shows that the closer the ROC curve is to upper left corner, the higher the prediction rate of model. The 
point of the ROC curve closest to upper left corner is best classification with lowest error based on the 
maximum threshold and the least total number of FPR and TPR. So from the curve, we can conclude that the 
LightGBM-+swarm is superior to LightGBM. 
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Figure 5. The ROC curve performance comparison LightGBM+ACO, LightGBM+BCO and LightGBM 


5. CONCLUSION 

In this study, the LightGBM algorithm is improved through feature engineering or feature selection 
using the BCO algorithm and the ACO algorithm to create a P2P loan evaluation model, especially the 
prediction of credit defaults. The experiment uses data sets from kaggle.com to show that improved 
LightGBM is successful. The best feature selection process is selected 6 out of 18 features. The SMOTE 
method is also provided to solve the unbalance class problem in the dataset, then a series of operations such 
as data cleaning and dimension reduction are performed. The experimental results prove that the LightGBM 
Algorithm has been successfully improved. This success is shown by the prediction accuracy of LightGBM + 
ACO is 95.64%, LighGBM + BCO is 94.70% and LightGBM is 94.38%. This success also demonstrates 
outstanding performance in predicting loan default and strong generalizations. 
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